The statistical regularities in language corpora encode well-known social
biases into word embeddings. Here, we focus on gender to provide a
comprehensive analysis of group-based biases in widely-used static English word
embeddings trained on internet corpora (GloVe 2014, fastText 2017). Using the
Single-Category Word Embedding Association Test, we demonstrate the widespread
prevalence of gender biases that also show differences in: (1) frequencies of
words associated with men versus women; (b) part-of-speech tags in
gender-associated words; (c) semantic categories in gender-associated words;
and (d) valence, arousal, and dominance in gender-associated words.
First, in terms of word frequency: we find that, of the 1,000 most frequent
words in the vocabulary, 77% are more associated with men than women, providing
direct evidence of a masculine default in the everyday language of the
English-speaking world. Second, turning to parts-of-speech: the top
male-associated words are typically verbs (e.g., fight, overpower) while the
top female-associated words are typically adjectives and adverbs (e.g., giving,
emotionally). Gender biases in embeddings also permeate parts-of-speech. Third,
for semantic categories: bottom-up, cluster analyses of the top 1,000 words
associated with each gender. The top male-associated concepts include roles and
domains of big tech, engineering, religion, sports, and violence; in contrast,
the top female-associated concepts are less focused on roles, including,
instead, female-specific slurs and sexual content, as well as appearance and
kitchen terms. Fourth, using human ratings of word valence, arousal, and
dominance from a ~20,000 word lexicon, we find that male-associated words are
higher on arousal and dominance, while female-associated words are higher on
valence.